Body Mass Index (BMI), age, height and weight are important indicators of human health conditions, which can provide useful information for plenty of practical purposes, such as health care, monitoring and re-identification. Most existing methods of health indicator prediction mainly use front-view body or face images. These inputs are hard to be obtained in daily life and often lead to the lack of robustness for the models, considering their strict requirements on view and pose. In this paper, we propose to employ gait videos to predict health indicators, which are more prevalent in surveillance and home monitoring scenarios. However, the study of health indicator prediction from gait videos using deep learning was hindered due to the small amount of open-sourced data. To address this issue, we analyse the similarity and relationship between pose estimation and health indicator prediction tasks, and then propose a paradigm enabling deep learning for small health indicator datasets by pre-training on the pose estimation task. Furthermore, to better suit the health indicator prediction task, we bring forward Global-Local Aware aNd Centrosymmetric Encoder (GLANCE) module. It first extracts local and global features by progressive convolutions and then fuses multi-level features by a centrosymmetric double-path hourglass structure in two different ways. Experiments demonstrate that the proposed paradigm achieves state-of-the-art results for predicting health indicators on MoVi, and that the GLANCE module is also beneficial for pose estimation on 3DPW.
translated by 谷歌翻译
We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference. Previous methods in unsupervised video object segmentation (UVOS) have demonstrated the effectiveness of motion as either input or supervision for segmentation. However, motion signals may be uninformative or even misleading in cases such as deformable objects and objects with reflections, causing unsatisfactory segmentation. In contrast, IMAS achieves Improved UVOS with Motion-Appearance Synergy. Our method has two training stages: 1) a motion-supervised object discovery stage that deals with motion-appearance conflicts through a learnable residual pathway; 2) a refinement stage with both low- and high-level appearance supervision to correct model misconceptions learned from misleading motion cues. Additionally, we propose motion-semantic alignment as a model-agnostic annotation-free hyperparam tuning method. We demonstrate its effectiveness in tuning critical hyperparams previously tuned with human annotation or hand-crafted hyperparam-specific metrics. IMAS greatly improves the segmentation quality on several common UVOS benchmarks. For example, we surpass previous methods by 8.3% on DAVIS16 benchmark with only standard ResNet and convolutional heads. We intend to release our code for future research and applications.
translated by 谷歌翻译
解开内容和口语样式信息对于零击的非并行语音转换(VC)至关重要。我们先前的研究调查了一个新型框架,该框架具有分离的顺序变分自动编码器(DSVAE),作为信息分解的骨干。我们已经证明,对于零拍的VC来说,同时解开嵌入的内容和嵌入的说话者是可行的。在这项研究中,我们通过提出对DSVAE基线中内容分支的先验分布的关注来继续方向。我们发现随机初始化的先验分布将迫使内容嵌入以减少学习过程中的语音结构信息,这不是所需的属性。在这里,我们试图获得更好的内容,并保留更多的语音信息。我们提出了条件DSVAE,这是一个新模型,可以使内容偏置作为先验建模的条件,并重塑从后分布中采样的内容。在我们在VCTK数据集的实验中,我们证明了从条件DSVAE中得出的内容嵌入克服了随机性,并获得了更好的音素分类精度,稳定的发声和与竞争性DSVAE基线相比,稳定的发声和更好的零击VC性能。
translated by 谷歌翻译
这项工作研究了伪标签的偏见问题,一种广泛发生的自然现象,但经常通过先前的研究忽视。当在源数据上培训的分类器被传送到未标记的目标数据时,会生成伪标签。当半监督的学习模型Fixmatch预测未标记的数据时,我们观察到沉重的长尾伪标签即使未标记的数据被策划到平衡。没有干预,培训模型继承了伪标签的偏置,最终是次优。为了消除模型偏置,我们提出了一种简单而有效的方法DebiSmatch,包括自适应脱叠模块和自适应边际损失。通过使用在线更新的队列,可以自动调整脱叠的强度和边距的大小。在ImageNet-1K上基准测试,DebiasMatch分别在半监督学习(0.2%注释数据)和零拍摄学习任务中显着超过26%和8.7%的最先进。
translated by 谷歌翻译
基于深度学习的模型占主导地位的生产推荐系统的当前景观。此外,近年来目睹了模型规模的指数增长 - 从谷歌的2016年模型,最新的Facebook的型号有10亿个参数,具有12万亿参数。型号容量的每次跳跃都有显着的质量增强,这使我们相信100万亿参数的时代即将来临。然而,即使在工业规模数据中心内,这些模型的培训也在挑战。这种困难是从训练计算的惊人的异质性继承 - 模型的嵌入层可以包括总模型尺寸的99.99%,这是极其内存密集的;虽然其余的神经网络越来越多地计算密集型。为支持培训此类巨大模式,迫切需要有效的分布式培训系统。在本文中,我们通过仔细共同设计优化算法和分布式系统架构来解决这一挑战。具体而言,为了确保培训效率和训练精度,我们设计一种新型混合训练算法,其中嵌入层和密集的神经网络由不同的同步机制处理;然后,我们构建一个名为Persia的系统(短暂的并行推荐培训系统,其中包含混合加速),以支持这种混合培训算法。理论上的示范和实证研究均达到100万亿参数,以证明了波斯的系统设计和实施。我们将Pensia公开使用(在https://github.com/persiamml/persia),以便任何人都能够以100万亿参数的规模轻松培训推荐模型。
translated by 谷歌翻译
我们研究了用于半监控学习(SSL)的无监督数据选择,其中可以提供大规模的未标记数据集,并且为标签采集预算小额数据子集。现有的SSL方法专注于学习一个有效地集成了来自给定小标记数据和大型未标记数据的信息的模型,而我们专注于选择正确的数据以用于SSL的注释,而无需任何标签或任务信息。直观地,要标记的实例应统称为下游任务的最大多样性和覆盖范围,并且单独具有用于SSL的最大信息传播实用程序。我们以三步数据为中心的SSL方法形式化这些概念,使稳定性和精度的纤维液改善8%的CiFar-10(标记为0.08%)和14%的Imagenet -1k(标记为0.2%)。它也是一种具有各种SSL方法的通用框架,提供一致的性能增益。我们的工作表明,在仔细选择注释数据上花费的小计算带来了大注释效率和模型性能增益,而无需改变学习管道。我们完全无监督的数据选择可以轻松扩展到其他弱监督的学习设置。
translated by 谷歌翻译
最近,神经技术已用于自动生成源代码。这些方法在有望获得声明语言的同时,在命令式语言的数据集上的性能差得多。由于通常将声明性语言嵌入了现实世界软件开发中的命令式语言(即Turducken式编程)中,因此声明语言的有希望的结果几乎不会导致手动软件开发工作大幅减少。在本文中,我们定义了一项新的代码生成任务:鉴于自然语言评论,此任务旨在用嵌入式声明语言以基本命令性语言生成程序。据我们所知,这是第一个Turducken风格的代码生成任务。对于此任务,我们将Lyra:Python中的数据集提出了嵌入式SQL。该数据集包含来自现实世界项目的2,000个精心注释的数据库操作程序。每个程序都与中文评论和英文评论配对。在我们的实验中,我们采用了变压器,伯特风格和GPT风格的模型作为基础。在最佳环境中,GPT风格模型的生成性能比其他模型更好,在使用中文和英语评论时,AST精确匹配的精度分别为24%和25.5%。因此,我们认为Lyra为代码生成提供了新的挑战。但是,克服这一挑战可能会大大提高代码生成技术在现实世界软件开发中的适用性。
translated by 谷歌翻译
最近,数据库管理系统(DBMS)社区目睹了机器学习(ML)解决方案的DBMS任务的能力。尽管表现明显,但这些现有解决方案几乎不会被认为是令人满意的。首先,DBMS中的基于ML的方法不够有效,因为它们在每个特定任务上进行了优化,并且无法探索或理解任务之间的内部连接。其次,培训过程具有严重的限制,妨碍他们的实用性,因为他们需要从划痕中恢复整个模型以获得新的dB。此外,对于每个再次,它们需要过多的训练数据,这对于新的DB来获得和不可用的非常昂贵。我们建议探讨ML方法跨任务和跨DBS的传递,以解决这些基本缺点。在本文中,我们提出了一个统一的模型MTMLF,它使用多任务培训程序来捕获任务的可转让知识和预先列车前的微调程序,以蒸馏出跨DBS的可转移元知识。我们认为,此范例更适合云DB服务,并且有可能彻底改变ML如何在DBMS中使用的方式。此外,为了证明MTMLF的预测力和可行性,我们提供了关于查询优化任务的具体和非常有希望的案例研究。最后但并非最不重要的是,我们沿着这一工作线讨论了几个具体的研究机会。
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译
In the field of cross-modal retrieval, single encoder models tend to perform better than dual encoder models, but they suffer from high latency and low throughput. In this paper, we present a dual encoder model called BagFormer that utilizes a cross modal interaction mechanism to improve recall performance without sacrificing latency and throughput. BagFormer achieves this through the use of bag-wise interactions, which allow for the transformation of text to a more appropriate granularity and the incorporation of entity knowledge into the model. Our experiments demonstrate that BagFormer is able to achieve results comparable to state-of-the-art single encoder models in cross-modal retrieval tasks, while also offering efficient training and inference with 20.72 times lower latency and 25.74 times higher throughput.
translated by 谷歌翻译